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Abstract:-Mcichiiw learning and data mining retort heavily on a large amount of data to build learning models and make 
predictions. There is a need for quality of data, thus the quality of data is ultimately important. Many of the industrial and 
research databases are plagued by the problem of missing values. A variety of methods have been developed with great 
success on dealing with missing values in data sets with uniform attributes. But in real life dataset contains heterogeneous 
attributes. In this paper, apart from the overview of imputation, then discussing about the proposed work i .e a new setting 
of handling missing data imputation (that is imputing missing data in data sets with mixed attributes and also in clustered 
data sets) in turn parametria mixture kernel based. 
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I. Introduction 

Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology 
with great potential to help companies focus on the most important information in their data warehouses. A common problem 
in data mining is that of automatically finding outliers or anomalies in a database. Outliers are an observation that is 
numerically distant from the rest of the data. Since outliers and anomalies are highly unlikely, they can be indicative of bad 
data or malicious behaviour. Bad data interns produce falls outcome. Examples of bad data include skewed data values 
resulting from measurement error, or erroneous values resulting from data entry mistakes, missing values, missing data. 
Missing data, or Missing values, occur when no data value is stored for the variable in the current observation. Common 
solution is either ignore the missing data is called as marginalization or fill in the missing values is called as imputation. 
Imputed values are treated as just as reliable as the truly observed data, but they are only as good as the assumptions used to 
create them. 

Techniques of dealing with missing values can be classified into three categories [7], [12]. 1) Deletion, 2) Learning 
without handling of missing values, and 3) Missing value imputation The first technique is to simply omit those cases with 
missing values and only to use the remaining instances to finish the learning assignments [13]. The deletion is classified in 
two categories they are, i) List wise or Case deletion ii) Pair wise deletion. The second approach is to learn without handling 
of missing data, such as Bayesian Networks method, Artificial Neural Networks method 15, the methods in [10]. Missing 
data imputation is a procedure that replaces the missing values with some possible values, such as [11], [12]. A variety of 
methods have been developed with great success on dealing with missing values in data sets with uniform attributes. (their 
independent attributes are all either continuous or discrete). 

However, these imputation algorithms cannot be applied to many real data sets, such as equipment maintenance 
databases, industrial data sets, and medical databases, because these data sets are often with continuous, discrete and 
categorical independent attributes. These heterogeneous data sets are referred to as mixed-attribute data sets and their 
independent attributes are called as mixed independent attributes. It advocates that a missing datum is imputed if and only if 
there are some complete instances in a small neighbourhood of the missing datum, otherwise, it should not be imputed. 
Further, a Non parametric iterative estimator is proposed to utilize all the available observed information, including observed 
information in incomplete instances with missing values. 

In this paper, we present an imputation overview in that we discuss the problem of imputing the mixed attribute 
datasets and then we see how this problem can be solved by implementing the nonparametric iterative imputation method for 
estimating missing values in mixed-attribute data sets and also in clustered data sets. 

II. Imputation Overview 

Missing data imputation is a procedure that replaces the missing values with some possible values. Imputed values are 
treated as just as reliable as the truly observed data, but they are only as good as the assumptions used to create them. The 
imputation consists of many types. In that some types of imputations are, (i) Single Imputation, (ii) Partial Imputation and 
(iii) Multiple Imputation, (iv) Iterative Imputation. According to our paper, previous work has been handling the missing 
values in heterogeneous data sets using semi parametric way of iterative imputation method [15]. 

Normally this method is inconsistent in some datasets. To avoid this problem, and also to improving the efficiency, the 
non parametric way is possible. So the proposed work based on handling the missing values in heterogeneous datasets and 
also in clustered data sets (only continuous attributes) using non parametric way of iterative imputation. 
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III. OBJECTIVE OF OUR WORK 

The proposed work bring out the new setting of missing data imputation, i.e., imputing missing data in data sets with 
mixed attributes (their independent attributes are of different types i.e. the datasets consists of both discrete and continuous 
attributes), referred to as imputing mixed-attribute data sets in [13]. Although many real applications are in this setting, there 
is no estimator designed for imputing data sets with heterogeneous attributes. It first proposes two reliable estimators for 
discrete and continuous missing target values, respectively. Imputing mixed-attribute data sets can be taken as a new problem 
in missing data imputation because there is no estimator designed for imputing missing data in mixed attribute data sets. 

IV. PROPOSED WORK 

The challenging issues include, such as how measuring the relationship between instances (transactions) in a mixed- 
attribute data set, and how to construct hybrid estimators using the observed data in the data set. To address the issue, this 
research proposes a nonparametric iterative imputation method based on a mixture kernel for estimating missing values in 
mixed-attribute data sets. A mixture of kernel functions (a linear combination of two single kernel functions, called mixture 
kernel) is designed for the estimator in which the mixture kernel is used to replace the single kernel function in traditional 
kernel estimators. These estimators are referred to as mixture kernel estimators. 

Based on this, two consistent kernel estimators are constructed for discrete and continuous missing target values, 
respectively, for mixed-attribute data sets. Further, a mixture-kernel-based iterative estimator is proposed to utilizes all 
available observed information, including observed information in incomplete instances (with missing values), to impute 
missing values, whereas existing imputation methods use only the observed information in complete instances (without 
missing values). To improve the accuracy cluster based non-parametric iterative imputation is proposed. Fig 1 shows that 
proposed system architecture. It initially considers the database with missing values, and then identifies the attribute type by 
using appropriate techniques to find attributes of either continuous or discrete attribute. If it is a continuous attribute Mean 
Pre -Imputation is applied otherwise Mode Pre-Imputation is applied. This is the basic step of imputation techniques. Then by 
using the pre imputed data sets kernel function is applied separately to both the attributes. 

This imputation is said to be single imputation. Mixture kernel function is obtained by integrating both the discrete 
and continuous kernel function. Estimated value is calculated by the standard formulas. Finally Iterative kernel estimator is 
applied separately for continuous as well as discrete attributes to get final value for imputation. This data will be imputed in 
the missing data set to make it as a complete dataset. Further to improve the accuracy clustering algorithm is applied. This 
clustered data set considered as a first step of the framework. 




Fig. 1. System Architecture for Proposed System 

There are five steps in our proposed system. They are (i) Data Preparation (ii) Single Imputation Using Kernel 
Function (iii) Constructing the Estimator and Iterative Imputation (iv) Pre-Processing dataset Using Clustering Algorithm (v) 
Performance Analysis. 

4.1 Data Preparation 

In this module, from the input heterogeneous data set the records with missing values will be identified and 
categorized based on attribute type of missing values, attributes are grouped. Mean and mode value for continuous and 
discrete category is calculated separately. Basic imputation has been done with this calculated value. 



www.ijmer.com 



1594 I Page 



International Journal of Modern Engineering Research (IJMER) 
www.ijmer.com Vol. 3, Issue. 3, May.-June. 2013 pp-1593-1596 ISSN: 2249-6645 

4.2 Single Imputation Using Kernel Function 

This module shows about the kernel function. After getting the basic imputation, then apply the kernel function 
separately for both the discrete and continuous attributes. Then integrate both the discrete and kernel function to get the 
e kernel function 



4.2.1 Discrete Kernel Function 

fl if Xfj : 



L[Xf ( ,xf )= 
Where, 



Xj — Discrete Variable or attributes 
X — Smoothing Parameter 

Normally discrete attributes are contains a binary format values example is either it will be 0 or l.so for this step ,the 
output will shows about the similar values as the imputation for the missing values by taking one attribute as a relation. 

4.2.2 Continuous Kernel Function 

K(x-Xi/h) (2) 

K(.) is a mercer kernel, i.e., positive definite kernel. 

4.2.3 Mixture Kernel Function 

K h> Xiix = K(x-X/h) L(Xi d , Xi d , X) (3) 

Where, 

h->0 and \->0 (k, h is the smoothing parameter for the discrete and continuous kernel function , respectively), 
Ka,ix — symmetric probability density function. 
K(x-X/h) — Continuous Kernel Function 
L(X; d , x d , X) — Discrete Kernel Function 

4.3 Constructing the Estimator and Iterative Imputation 

Construct the estimator, separately for both attributes. Estimator is nothing but, it attempts to approximate the 
unknown parameter using the measurements. Then by the idea of the estimator calculate the iterative value for each attributes 
by using the formula. The iterative method explains that all the imputed values are used to impute subsequent missing 
values, i.e., the (t+l)th (t>l) iteration imputation is carried out based on the imputed results of the t th imputation, until the 
filled-in values converge or begin to cycle or satisfy the demands of the users. 

Normally first imputation is single imputation. It cannot provide valid standard confidence intervals. Therefore 
running extra (imputation) iterative imputation based on the first imputation is reasonable and necessary for better dealing 
with the missing values. Since the second iteration imputation is carried out based on the former imputed results. Here, a 
stopping criterion is designed for nonparametric iterations. With t imputation times, there will be (t-1) chains of iterations. 
Note that the first imputation won't b considered when talking about the convergence because the final results will be decided 
mainly by imputation from the second imputation. Of course, the result in the first imputation always generates, to some 
extent, effects for the final results 

4.3.1 Kernel estimator for Continuous Missing attributes 

m(x) = '— (4) 

where , 

item n" m(x) — only used for avoiding the denominator to be 0. 
Yi — Denoting the ith Missing Value. 

4.3.2 Kernel estimator for Discrete Missing attributes 

When the missing value m(X) is in a discrete attribute, the estimator is, let D m(x) = (0,1, ,c u -l) denote the range of 

m(x).One could estimate m(x) by, 

lfl(x) = : 

n 2 2T-i ^ h * « + n 2 n ~ 2 2Xl ^h,X,ix + n 

Where 1(Y; ,y, X )=1 if y= Y, and X if y^Y;.. 
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4.3.3 Iterative Kernel Estimator for continuous Missing attributes 




ere, 

r \ — tth imputation of i 





(6) 



4.3.4 Iterative Kernel Estimator for discrete Missing attributes 




(7) 



In particular, Y* is the best common class in the discrete target variable, and 
Y\=0,i = r+l,....,n. 

4.4 Pre-Processing Data set using cluster Algorithm 

Before sending data to the data preparation module, clustering take place to group similar data object. By applying 
the formula mentioned below, the data sets are grouped in two sets with respect to every attribute. 

4.5 Performance Analysis 

Imputed values without using clustering and using k-means clustering are compared. The performance analysis takes 
place by using both the method 



Imputation is the best solution for handling the Missing values. Missing data imputation is a procedure that replaces the 
missing values with some possible values. But this is not appropriate solution for discrete and categorical missing values. A 
consistent kernel regression has been proposed for imputing missing values in a mixed-attribute data set and uses the 
techniques of data driven method for bandwidth selection. 

The data-driven (i.e., automatic) bandwidth selection procedures are not guaranteed always to produce good results 
due to perhaps the presence of outliers or the rounding/discretization of continuous data, among others. The nonparametric 
estimators are proposed against the case that data sets have both continuous and discrete independent attributes and also in 
clustered data sets. It utilizes all available observed information, including observed information in incomplete instances 
(with missing values), to impute missing values, whereas existing imputation methods use only the observed information in 
complete instances (without missing values). That is the work includes exploring a framework for non parametric iterative 
imputation based on mixture kernel estimation in both mixture data sets and also in clustered data sets (only continuous 
attributes). In future work furthermore, this paper could be extended to handle this imputation process in more than one 
missing value in a single attribute. 
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